异常检测分析

  • 什么是异常检测

    定义: 孤立点、离群、与众不同、超出范围、不在预期...
    原因: 测量误差、记录误差、随机、非随机其他因素导致...
    应用: 故障检测、网络入侵、欺诈分析、医疗药物...

  • 异常vs正常

    一个人的异常可能是另一个人的信号 --推断来自抽样,应用需要谨慎
    作为运动青年和普通青年的差异:
    科比

  • 异常检测方法

    统计:非随机性
    密度:且看身边
    聚类:无从归属
    其他:距离、序列、偏差...

  • 统计--异常检测方法
    3σ原则, 小概率事件
In [2]:
# 随机数据 生成
n=1000;sdout=3
x=rnorm(n,mean=0,sd=1)
y=rnorm(n,mean=5,sd=2)

看分布

In [3]:
par(mfrow = c(1, 2), ann = FALSE)
options(repr.plot.width=8, repr.plot.height=4)
# 二维布局 
plot(x,y,pch=19,cex=1)
oindx=abs(x-0)>sdout*1;oindy=abs(y-5)>sdout*2
oindxy=oindx | oindy  
points(x[oindxy],y[oindxy],pch='+',col='red',cex=2)
# 一维分布
ovalx=x[oindx]
hist(x,xlim=range(x))
lines(sort(x),dnorm(sort(x))*450)
points(ovalx,dnorm(ovalx),pch='+',col='red',cex=2)

看箱图 更直观

In [4]:
# 箱图 (中位数,四分位数,异常值)
options(repr.plot.width=6, repr.plot.height=5)
boxplot(x,y)
bx=boxplot.stats(x)
#for (i in bx$out) points(i,pch='+',col='red',cex=1)
points(rep(1,length(bx$out)),bx$out,pch='x',col='red',cex=1)
points(rep(1,5),bx$stats,pch='x',col='blue',cex=1)

boxplot画的是什么

In [5]:
#上下边框
bx$stats
fivenum(x)
t(quantile(x,c(25,50,75)/100))
fivenum
  1. -2.53857725326263
  2. -0.641645935271929
  3. -0.0152562817730275
  4. 0.641871186139328
  5. 2.50415487119219
  1. -3.31439609110045
  2. -0.641645935271929
  3. -0.0152562817730275
  4. 0.641871186139328
  5. 3.50446082429514
25%50%75%
-0.64149298-0.01525628 0.64180334
function (x, na.rm = TRUE) 
{
    xna <- is.na(x)
    if (any(xna)) {
        if (na.rm) 
            x <- x[!xna]
        else return(rep.int(NA, 5))
    }
    x <- sort(x)
    n <- length(x)
    if (n == 0) 
        rep.int(NA, 5)
    else {
        n4 <- floor((n + 3)/2)/2
        d <- c(1, n4, (n + 1)/2, n + 1 - n4, n)
        0.5 * (x[floor(d)] + x[ceiling(d)])
    }
}
In [6]:
# 上下边界  (help(boxplot.stats),看coef参数)
#http://r.789695.n4.nabble.com/Whiskers-on-the-default-boxplot-graphics-td2195503.html
IQR(x)
quantile(x,75/100)-quantile(x,25/100)
quantile(x,75/100)+1.5*IQR(x)
bx$stats
sort( x[x>2    & (x< (quantile(x,75/100)+1.5*IQR(x)))] )
sort( x[x<(-2) & (x> (quantile(x,25/100)-1.5*IQR(x)))] )
1.28329632842284
75%: 1.28329632842284
75%: 2.5667478371699
  1. -2.53857725326263
  2. -0.641645935271929
  3. -0.0152562817730275
  4. 0.641871186139328
  5. 2.50415487119219
  1. 2.00857659605993
  2. 2.01330498360655
  3. 2.09221664548117
  4. 2.13526474314856
  5. 2.15912792962683
  6. 2.20069558876359
  7. 2.20878639074813
  8. 2.23541743022953
  9. 2.28096163534843
  10. 2.32318162148963
  11. 2.34287221249508
  12. 2.3680779512695
  13. 2.39131522962954
  14. 2.41183200558629
  15. 2.50415487119219
  1. -2.53857725326263
  2. -2.43281737821378
  3. -2.23740634035229
  4. -2.2124992242306
  5. -2.21064098158478
  6. -2.19413505972392
  7. -2.14470275666434
  8. -2.09722329485777
  9. -2.06635396705753
  10. -2.0430688910097
  11. -2.03473623591339
  12. -2.00429155645384
  13. -2.00393313133406

既然是统计,既然是分布,那现实(数据)与理想(假设)不符时...

In [7]:
options(repr.plot.width=8, repr.plot.height=4)
par(mfrow = c(1, 2), ann = FALSE)
#
x2=c(rnorm(n,mean=0,sd=1),rnorm(n/3,mean=8,sd=2))
xh=hist(x2,breaks=seq(range(x2)[1]-1,range(x2)[2]+1,0.5))
#
xh=hist(x2,breaks=seq(range(x2)[1]-1,range(x2)[2]+1,0.5))
lines(range(xh$mids[xh$density<0.01 & xh$mids <5 & xh$mids>1]),c(2,2),col='red',lwd=10 )
par(new=TRUE)
boxplot(x2,horizontal = TRUE,col='grey')

红色区数据从箱图上看,落在上下边界之内,属正常范围。

  • 密度--异常检测方法
    就近比较,稀疏为异
In [8]:
options(repr.plot.width=8, repr.plot.height=4)
par(mfrow = c(1, 2), ann = FALSE)
library(DMwR)
#
y2=c(rnorm(n,mean=0,sd=4),rnorm(n/3,mean=6,sd=3))
dfxy=data.frame(x2=x2,y2=y2)
plot(dfxy,pch=19,cex=1)
#interesting when change k
plot(dfxy,pch=19,cex=1)
score_dfxy=lofactor(dfxy, k = 5)
out_dfxy=order(score_dfxy, decreasing = T)[1:10]
points(dfxy[out_dfxy,],col='red',pch='x',cex=1.2)
Loading required package: lattice
Loading required package: grid
  • 聚类--异常检测方法
    小众,异类
In [9]:
dfxy$cc=NULL

kmeans

In [29]:
options(repr.plot.width=6, repr.plot.height=5)
nc=5
cc=kmeans(dfxy,nc)
ccout=fitted(cc)
plot(y2~x2,col=rownames(ccout),dfxy,cex=2,pch = ".")
#centers
points(cc$centers, col = 1:nc, pch = "o", cex = 2)  
#outliers
centers=cc$centers[cc$cluster,]
dists=sqrt(rowSums((dfxy-centers)^2))
outers=order(dists,decreasing = TRUE)[1:20]
points(dfxy[outers,c('x2','y2')], col = 9, pch = "+", cex = 1)

Hierarchical Clustering

In [30]:
options(repr.plot.width=8, repr.plot.height=4)
par(mfrow = c(1, 2), ann = FALSE)
hc = hclust(dist(dfxy), "ave")
plot(hc)
ccout = cutree(hc, k = 5)
plot(y2~x2,col=ccout,dfxy)
  • 序列异常
In [31]:
library(quantmod)
options("getSymbols.warning4.0"=FALSE)
setSymbolLookup(TL=list(name="000025.sz",src="yahoo"))
getSymbols("TL")
candleChart(TL,theme='white', type='candles')
Loading required package: xts
Loading required package: zoo

Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

Loading required package: TTR
Version 0.4-0 included new data defaults. See ?getSymbols.
"TL"

hello R! We can plotly U !

参考

![1]:陈斌,陈松灿等,2009,"异常检测综述",山东大学学报.
![2]:cador,2014,"使用R语言进行异常检测".
![3]:杨风召,"异常检测算法综述"
![4]:杨永铭,王喆,2008.01,"孤立点挖掘算法研究",计算机与数字工程.
![5]:高磊,2013,"虚假交易识别"